feat: implement DataWriter for Iceberg data files by shangxinli · Pull Request #552 · apache/iceberg-cpp

shangxinli · 2026-01-31T17:43:13Z

Implements DataWriter class for writing Iceberg data files as part of issue #441 (task 2).

Implementation:

Factory method DataWriter::Make() for creating writer instances
Support for Parquet and Avro file formats via WriterFactoryRegistry
Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID
Proper lifecycle management with Initialize/Write/Close/Metadata
PIMPL idiom for ABI stability

Related to #441

evindj · 2026-02-03T06:17:48Z

src/iceberg/data/data_writer.cc

+
+    ICEBERG_ASSIGN_OR_RAISE(writer_,
+                            WriterFactoryRegistry::Open(options_.format, writer_options));
+    return {};


It is odd that an empty structure is always returned. Also, since this is initialization why not doing in the ctor?

Refactored the initialization logic

evindj · 2026-02-03T06:19:55Z

src/iceberg/data/data_writer.cc

+    if (closed_) {
+      return InvalidArgument("Writer already closed");
+    }


I could see a case for making close idempotent, is there any strong reason why we want to return this error instead of no op for example?

evindj · 2026-02-03T06:23:31Z

src/iceberg/data/data_writer.cc

+      return InvalidArgument("Writer already closed");
+    }
+    ICEBERG_RETURN_UNEXPECTED(writer_->Close());
+    closed_ = true;


Should this class address thread safety?

Good question! I've added explicit documentation that this class is not thread-safe:

I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

evindj · 2026-02-03T06:29:41Z

src/iceberg/test/data_writer_test.cc

+TEST_F(DataWriterTest, CreateWithParquetFormat) {
+  DataWriterOptions options{
+      .path = "test_data.parquet",
+      .schema = schema_,
+      .spec = partition_spec_,
+      .partition = PartitionValues{},
+      .format = FileFormatType::kParquet,
+      .io = file_io_,
+      .properties = {{"write.parquet.compression-codec", "uncompressed"}},
+  };
+
+  auto writer_result = DataWriter::Make(options);
+  ASSERT_THAT(writer_result, IsOk());
+  auto writer = std::move(writer_result.value());
+  ASSERT_NE(writer, nullptr);
+}
+
+TEST_F(DataWriterTest, CreateWithAvroFormat) {
+  DataWriterOptions options{
+      .path = "test_data.avro",
+      .schema = schema_,
+      .spec = partition_spec_,
+      .partition = PartitionValues{},
+      .format = FileFormatType::kAvro,
+      .io = file_io_,
+  };
+
+  auto writer_result = DataWriter::Make(options);
+  ASSERT_THAT(writer_result, IsOk());
+  auto writer = std::move(writer_result.value());
+  ASSERT_NE(writer, nullptr);
+}


nit: The two tests are quite similar, it is probably possible to leverage a function to reduce duplication

Consolidated the two tests using parameterized testing.

evindj · 2026-02-03T06:31:44Z

src/iceberg/test/data_writer_test.cc

+  // Check length before close
+  auto length_result = writer->Length();
+  ASSERT_THAT(length_result, IsOk());
+  EXPECT_GT(length_result.value(), 0);


nit: check the size of the data passed to the write function?

zhjwpku · 2026-02-03T06:41:10Z

src/iceberg/data/data_writer.cc

+    if (!writer_) {
+      return InvalidArgument("Writer not initialized");
+    }


Suggested change

if (!writer_) {

return InvalidArgument("Writer not initialized");

}

ICEBERG_PRECHECK(writer_, "Writer not initialized");

nit, this should make the code shorter.

Replaced all manual null checks with ICEBERG_PRECHECK

zhjwpku · 2026-02-03T06:42:02Z

src/iceberg/data/data_writer.cc

+  }
+
+  Result<FileWriter::WriteResult> Metadata() {
+    if (!closed_) {


nit: use ICEBERG_CHECK here

zhjwpku · 2026-02-03T06:45:18Z

src/iceberg/test/data_writer_test.cc

+  EXPECT_GT(length.value(), 0);
+}
+
+}  // namespace


nit: move this closing namespace curly before the first TEST_F?

Implements DataWriter class for writing Iceberg data files as part of issue apache#441 (task 2). Implementation: - Static factory method DataWriter::Make() for creating writer instances - Support for Parquet and Avro file formats via WriterFactoryRegistry - Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID - Proper lifecycle management with Write/Close/Metadata methods - Idempotent Close() - multiple calls succeed (no-op after first) - PIMPL idiom for ABI stability - Not thread-safe (documented) Tests: - 13 comprehensive unit tests including parameterized format tests - Coverage: creation, write/close lifecycle, metadata generation, error handling, feature validation, and data size verification - All tests passing (13/13) Related to apache#441

wgtmac · 2026-02-09T07:47:37Z

src/iceberg/data/data_writer.cc

 class DataWriter::Impl {
 public:
+  static Result<std::unique_ptr<Impl>> Make(DataWriterOptions options) {
+    WriterOptions writer_options;


nit: use aggregate initialization for writer_options

wgtmac · 2026-02-09T07:57:25Z

src/iceberg/data/data_writer.cc

+  }
+
+  Status Write(ArrowArray* data) {
+    ICEBERG_PRECHECK(writer_, "Writer not initialized");


Will this check ever fail? If not, should we remove the check or use ICEBERG_DCHECK instead? Same question for below.

wgtmac · 2026-02-09T07:59:11Z

src/iceberg/data/data_writer.cc

+      return InvalidArgument("Writer already closed");
+    }
+    ICEBERG_RETURN_UNEXPECTED(writer_->Close());
+    closed_ = true;


I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

wgtmac · 2026-02-09T08:00:12Z

src/iceberg/data/data_writer.cc

+  }
+
+  Result<FileWriter::WriteResult> Metadata() {
+    ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");


Suggested change

ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");

ICEBERG_CHECK(closed_, "Cannot get metadata before closing the writer");

We should return invalid state instead of invalid argument in this case.

wgtmac · 2026-02-09T08:03:11Z

src/iceberg/data/data_writer.cc

+    data_file->file_path = options_.path;
+    data_file->file_format = options_.format;
+    data_file->partition = options_.partition;
+    data_file->record_count = metrics.row_count.value_or(0);


Suggested change

data_file->record_count = metrics.row_count.value_or(0);

data_file->record_count = metrics.row_count.value_or(-1);

Java impl uses -1 when row count is unavailable.

wgtmac · 2026-02-09T08:03:33Z

src/iceberg/data/data_writer.cc

+    auto split_offsets = writer_->split_offsets();
+
+    auto data_file = std::make_shared<DataFile>();
+    data_file->content = DataFile::Content::kData;


nit: use aggregate initialization

wgtmac · 2026-02-09T08:06:32Z

src/iceberg/data/data_writer.cc

+
+    // Convert metrics maps from unordered_map to map
+    for (const auto& [col_id, size] : metrics.column_sizes) {
+      data_file->column_sizes[col_id] = size;


Do you think it makes sense to change DataFile and Metrics classes to use std::map or std::unordered_map consistently so we don't need to use a for-loop here?

cc @zhjwpku

wgtmac · 2026-02-09T08:12:02Z

src/iceberg/test/data_writer_test.cc

+        SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),
+        SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});


Suggested change

SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),

SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});

SchemaField::MakeRequired(1, "id", int32()),

SchemaField::MakeOptional(2, "name", string())});

wgtmac · 2026-02-09T08:19:02Z

src/iceberg/test/data_writer_test.cc

+
+using ::testing::HasSubstr;
+
+class DataWriterTest : public ::testing::Test {


Can we try to consolidate the test cases since each of them only test a tiny api with repeated boilerplate of creating writer and writing data? This may lead to test cases explosion if more and more cases are like this.

shangxinli force-pushed the implement-data-file-writer branch from 8944a75 to a201953 Compare January 31, 2026 17:59

evindj reviewed Feb 3, 2026

View reviewed changes

zhjwpku reviewed Feb 3, 2026

View reviewed changes

shangxinli force-pushed the implement-data-file-writer branch 2 times, most recently from 90d324e to 153d763 Compare February 7, 2026 01:31

shangxinli force-pushed the implement-data-file-writer branch from 153d763 to 147f25b Compare February 7, 2026 01:34

wgtmac reviewed Feb 9, 2026

View reviewed changes

	ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");
	ICEBERG_CHECK(closed_, "Cannot get metadata before closing the writer");

	data_file->record_count = metrics.row_count.value_or(0);
	data_file->record_count = metrics.row_count.value_or(-1);

		SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),
		SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});


		using ::testing::HasSubstr;

		class DataWriterTest : public ::testing::Test {

Conversation

shangxinli commented Jan 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shangxinli Feb 7, 2026 •

edited

Loading